Red wine is one of the most beautiful drinks, so it’s going to be interesting to find out what makes a good wine ! :)


Contents

  1. Data set
  2. Exploring data
  3. univariate plots
  4. Analyzing quality
        4.1 Correlations with other variables
        4.2 bivariate plots with other variables
  5. What chemical properties influence the quality
        5.1 Chemical properties correlation table
        5.2 Chemical properties bivariate plots
  6. Building linear regression model
  7. Final plots and summary
  8. Reflection
  9. Refrences
  10. Author and contact informaion

Date set

The data can be downloaded from this link, also you can find it on my github along with other report resources : link .

Also read this text file which describes the variables and how the data was collected.

The data-set contains 11 chemical characteristics beside a quality from 1 to 10 from at least 3 wine experts for 1599 different wines!


Exploring data

wine <- read.csv('./data/wineQualityReds.csv')

The data has 1599 observations of 13 variables.

str(wine)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Input variables (based on physicochemical tests):
1. fixed acidity (tartaric acid - g / dm^3)
2. volatile acidity (acetic acid - g / dm^3)
3. citric acid (g / dm^3)
4. residual sugar (g / dm^3)
5. chlorides (sodium chloride - (g / dm^3)
6. free sulfur dioxide (mg / dm^3)
7. total sulfur dioxide (mg / dm^3)
8. density (g / cm^3)
9. pH
10. sulphates (potassium sulphate - g / dm3)
11. alcohol (% by volume)
Output variable (based on sensory data):
12. quality (score between 0 and 10)


univariate plots

A closer look on the one variable plots.

  • The red lines represents the 25% and 75% quantiles(ie. 25% of the data lies on left for the first line), and the blue one represents the 50% quantile.


Analyzing quality

table(wine$quality)
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

82.5 % of wines either have quality of 5 or 6 .

Correlations with other variables

Let’s zoom into the correlation between quality and the chemical characteristics :

variable Pearson corr
fixed.acidity 0.12
volatile.acidity -0.39
citric.acid 0.23
residual.sugar 0.01
chlorides -0.13
free.sulfur.dioxide -0.05
total.sulfur.dioxide -0.19
density -0.17
pH -0.06
sulphates 0.25
alcohol 0.48

as we can see the only relatively good correlation is with the alcohol percentage.

bivariate plots with other variables

The below scatter plots between quality and each of the characteristics confirms the correlation values.

  • the black line represents the mean .

What chemical properties influence the quality

Now it’s time to put our question :

which chemcical chracterestics influence the quality, or it there any relation between them !

Logic says yes, but correlations and graphs says the opposite.

Lets think in some different way, instead of searching for the direct relation between each characteristic and quality, let’s find relations between chemical characteristics and each other.

Chemicals’ correlation table

The below correlation table is a good way to find these relations.

The correlations are computed using both Pearson and spearman algorithms, so each element in the table is structured as :     Pearson’s / spearman’s .

Correlations bigger than .3 or less than -.3 are colored in Red.

fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
fixed.acidity 1
volatile.acidity -0.26 / -0.28 1
citric.acid 0.67 / 0.66 -0.55 / -0.61 1
residual.sugar 0.11 / 0.22 0 / 0.03 0.14 / 0.18 1
chlorides 0.09 / 0.25 0.06 / 0.16 0.2 / 0.11 0.06 / 0.21 1
free.sulfur.dioxide -0.15 / -0.18 -0.01 / 0.02 -0.06 / -0.08 0.19 / 0.07 0.01 / 0 1
total.sulfur.dioxide -0.11 / -0.09 0.08 / 0.09 0.04 / 0.01 0.2 / 0.15 0.05 / 0.13 0.67 / 0.79 1
density 0.67 / 0.62 0.02 / 0.03 0.36 / 0.35 0.36 / 0.42 0.2 / 0.41 -0.02 / -0.04 0.07 / 0.13 1
pH -0.68 / -0.71 0.23 / 0.23 -0.54 / -0.55 -0.09 / -0.09 -0.27 / -0.23 0.07 / 0.12 -0.07 / -0.01 -0.34 / -0.31 1
sulphates 0.18 / 0.21 -0.26 / -0.33 0.31 / 0.33 0.01 / 0.04 0.37 / 0.02 0.05 / 0.05 0.04 / 0 0.15 / 0.16 -0.2 / -0.08 1
alcohol -0.06 / -0.07 -0.2 / -0.22 0.11 / 0.1 0.04 / 0.12 -0.22 / -0.28 -0.07 / -0.08 -0.21 / -0.26 -0.5 / -0.46 0.21 / 0.18 0.09 / 0.21 1

from the above table we can conclude the following :

fixed acidity is correlated to citric acid, density and pH.
volatile acidity is correlated to citric acid and sulphates.
citric acid is correlated to volatile, fixed acidity, pH and sulphates.
chlorides is correlated to density and sulphates.
density is correlated to fixed acidity, alcohol, residual sugar and chlorides.
pH is correlated to fixed acidity and citric acid.
sulphates is correlated to volatile acidity, citric acid and chlorides.
residual sugar is correlated to density.
alcohol is correlated to density.


And from that we get this tree :

So we have 7 parent nodes which has children :
  Quality, Alcohol, Density, Fixed Acidity, Chlorides, Citric acid and Volatile acidity.

And all of them depend on each other, so as we know alcohol affects quality, alcohol is affected by density which is affected by other chemicals which is affected…. and so on.

With counting negative and positive correlations, quality value increases when the following happen :

Drawing volatile acidity Drawing pH Drawing Sulphates

Drawing Citric acid Drawing pH Drawing Sulphates

Drawing Fixed acidity Drawing Residual sugar Drawing Chlorides

Drawing Density

Drawing Alcohol

Drawing Quality


Lets go back to our question, WHAT CHEMICAL PROPERTIES INFLUENCE THE QUALITY.

To answer that we must go through the earlier tree from the bottom to the top.

Chemical properties bivariate plots

The below plots explain that, the fist plot has the Quality as Y(dependent), then the next variable in the tree will be the Y of the next plot and so on .

  • the black line represents the line of best fit (linear model).
  • the purple line represents the mean.
  • the two blue lines represents the first and third quantile.
  • and the black points are the data .
  • density is multiplied to 1000 to convert it to the same unit as the other variables(gm/dm^3).


Building linear regression model

After we proved the relation between quality and chemical properties, lets build a regression model so in future if we have chemical properties for some wine, we can predict it’s quality.

Simple linear regression uses an independent variable to predict the outcome of a dependent variable.

we will use the formula Y ~ X , where X represents the relations represented above in the tree.
Because the variables are from different scales, so it would be nicer if all of them are scaled to the same scale. I’ll choose the scale from 0 to 10 , so every element in each variable will have a value from 0 to 10 keeping the statistics not changed.
A new variable is set for the new data called ‘wine.ratio’.

Now lets look at the model :

reg_lm <- lm( quality ~
                                
                                alcohol * density +   
                                density * fixed.acidity +  
                                density * residual.sugar +  
                                density * chlorides +  
                                chlorides * sulphates +   
                                fixed.acidity * pH +  
                                fixed.acidity * citric.acid +  
                                citric.acid * pH +  
                                citric.acid * volatile.acidity +  
                                citric.acid * sulphates+  
                                volatile.acidity * sulphates  
                            
                                         
                            ,data = wine.ratio )  

# print some information about the model
mtable( reg_lm, sdigits = 3 )
## 
## Calls:
## reg_lm: lm(formula = quality ~ alcohol * density + density * fixed.acidity + 
##     density * residual.sugar + density * chlorides + chlorides * 
##     sulphates + fixed.acidity * pH + fixed.acidity * citric.acid + 
##     citric.acid * pH + citric.acid * volatile.acidity + citric.acid * 
##     sulphates + volatile.acidity * sulphates, data = wine.ratio)
## 
## =============================================
##   (Intercept)                      5.230***  
##                                   (0.359)    
##   alcohol                          0.205***  
##                                   (0.040)    
##   density                         -0.012     
##                                   (0.066)    
##   fixed.acidity                   -0.013     
##                                   (0.091)    
##   residual.sugar                  -0.081     
##                                   (0.061)    
##   chlorides                        0.133     
##                                   (0.117)    
##   sulphates                        0.217**   
##                                   (0.069)    
##   pH                              -0.028     
##                                   (0.042)    
##   citric.acid                      0.110     
##                                   (0.085)    
##   volatile.acidity                -0.202***  
##                                   (0.038)    
##   alcohol x density               -0.003     
##                                   (0.007)    
##   density x fixed.acidity         -0.006     
##                                   (0.009)    
##   density x residual.sugar         0.015     
##                                   (0.009)    
##   density x chlorides             -0.020     
##                                   (0.022)    
##   chlorides x sulphates           -0.036*    
##                                   (0.014)    
##   fixed.acidity x pH               0.031*    
##                                   (0.013)    
##   fixed.acidity x citric.acid     -0.006     
##                                   (0.010)    
##   pH x citric.acid                -0.036**   
##                                   (0.013)    
##   citric.acid x volatile.acidity   0.015     
##                                   (0.009)    
##   sulphates x citric.acid         -0.001     
##                                   (0.012)    
##   sulphates x volatile.acidity    -0.004     
##                                   (0.019)    
## ---------------------------------------------
##   R-squared                           0.362  
##   adj. R-squared                      0.354  
##   sigma                               0.649  
##   F                                  44.847  
##   p                                   0.000  
##   Log-likelihood                  -1566.812  
##   Deviance                          664.474  
##   AIC                              3177.623  
##   BIC                              3295.920  
##   N                                1599      
## =============================================
  • Quality can be explained with this model by 36% (R-squared Value).
  • 95 % of the predicted interval should fall within +/- 129.8% of the fitted line.

Final plots and summary

I choose three plots to summary the analysis we did :

The first one is visualization of the regression model.
I’ll graph box plots for the formula Y ~ X, where Y (wine Quality) as a factor on the x-axis, and X is as shown above the relations between chemical properties and each other on the y-axis.
I’ll use the new data-set here wine.ratio.

As shown above the mean of X is getting higher as quality get higher for the quality ( 3,5,6,7), an exception for 4 and 8, the mean of X at quality 4 is lower than the mean at 3, and the mean at quality 8 is lower the mean at quality 7.
But still we can say the as quality increases the X increases.


The second one will show the difference between the actual quality, and the quality predicted by the regression model. Lets start first make a new column in the data called “quality.predicted”, it will hold the predicted data using the regression model.

wine$quality.predicted <- round( predict(reg_lm, wine.ratio ) )

Now lets plot it :

The bars shows the number of wines having a quality x.
The red ones for the actual quality, and the blue are for the predicted quality.
Most of the predicted quality are 5 and 6, and a little of 7. The model couldn’t predict the quality of 3,4 and 8.
Instead it predicted 5 and 6 more than the actual one.


The last one shows the density of each chemical property over each quality level from 3 to 8 .

Sulphates has low values in quality levels 3, 4, 5 and 6, and a little bit higher in 7 and 8.
Chlorides most values at quality level 8 are of value 1 , and as quality level goes lower the number of rows having value 1 reduces.
Residual sugar has low values for all quality levels.
Fixed acidity for low quality levels 3,4 and 5, have most of it’s values under 3, but in quality levels 6, 7 and 8 it’s values are spread out.
Other chemicals are spread on the graph for all quality levels.


Summary

We started by wondering about the relation between the quality of wine and it’s chemical properties, it’s clear that there must be a relation, although we proved some week relation but it still week and we can’t count on it .

So how does this make sense !, If we trusted that the chemical test were true and there is no error in the data, so there is error in the human factor !, lets not to forget that the quality is entered by humans and humans always make mistakes!.

So I believe to some degree that many values of the quality are entered from person favorite not because it’s actually high quality.


Reflection

So how we really get that relation between wine’s quality and it’s chemical properties ?.

I don’t believe that diving deeper in this data set would give me the answer. So to get the answer we have to find the best data set for it, maybe that data would contain prices, brands, and more accurate quality or drinkers’ review.

Also chemical properties aren’t everything that matters in wine, there still the type of the grape used, the quality of wine brand, any flavors added and other things that haven’t been considered in the data-set.

Another thing, the fact the most of the quality values are 5 or 6 makes it harder to analysis the data, there are no very good wines ( of quality 9 or 10), or very bad wines ( of quality 0, 1 or 2), which confirms also that the data aren’t strong enough to use it and as I said humans make mistakes.


Refrences

The data-set used in this report :

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at:
Elsevier
Pre-press (pdf)
bib


Author and contact information

This analysis is done by a udacity’s Data-analysis Nano-degree program Student as a course project.

Github: https://github.com/bekaa
LinkedIn: https://eg.linkedin.com/in/khaled-salah-48360590
WordPress:: https://khaledsalahblog.wordpress.com
Email: sci.kd.eg@gmail.com